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AB - As processors become more powerful and clusters larger, users will 
exploit this increased power to progressively run larger and larger 
problems. Today 1 s datasets in biology, physics or multimedia 
applications are huge and require high-performance storage 
sub-systems. As a result, the hot spot of cluster computing is 
gradually moving from high performance computing to high performance 
storage I/O. The solutions proposed by the parallel file-system 
community try to improve performance by working at the kernel level to 
enhance the regular I/O design or by using a dedicated Storage Area 
Network like Fiber Channel. We propose a new design to merge the 
communication network and the storage network at the best price. We 
have implemented it in OPIOM with the Myrinet interconnect: OPIOM 
moves data asynchronously from SCSI disks to the embedded memory of a 
Myrinet interface in order to send it to a remote node. This design 
presents attractive features: high performance and extremely low host 
overhead. 
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extremely low host overhead. Myrinet SCSI Storage Disk I/O Linux 
Cluster 1 Introduction The availability of powerful microprocessors 
and high-speed networks as commodity components is making clusters an 
appealing solution for cost-effective high performance computing. 
However, the bottleneck for users' applications tends to shift from 
the computation and the communication sides to the I/O domain: the 
problem sizes are bigger and bigger arid the time to load datasets into 
the cluster work pool and write the results to disks cannot be 
neglected any more. The new generation of commodity components like 
storage controllers and high-speed networks can be ef ficiendy used to 
break some architectural limitations inherited from the past, while 
keeping the price/performance ratio as low as possible. Our research 
effort improves a basic feature for the usage of parallel I/O ori 
clusters by removing bottlenecks. In —Section — 2 , we present the 
motivation of this work, one current limitation of Hie parallel I/O ■ 
design for clusters, and some related work'thai- tries to improve it. 
We propose a new design in —Section- 1 - 3 , describing^ our v/ 
contribution and detailing the issues that occurred during the ' " 1 : 1 
implementation. Then, — Section-- 4 shows the 1 results of some 
experimental benchmarks to highlight the bfeiiefit of our work for - r 
parallel I/O implementations. Finally, the conclusion in — Section— 
5 summarizes our work and presents the short and medium term 
perspectives of our project. 2 Motivation Today's clusters are larger 
and more powerful than ever before. They start to be used to face some 
Grand Challenges iri-genomics or nuclear* simulations, or even for 
intensive multimedia applications like Yideo'-on-Demand (VOD). The 
datasets used in these contexts are very large, and require the I/O 
sub-system to be as efficient as the computation or the communication 
components. There are two ways to achieve high-performance I/O in a 
cluster environment: To use a dedicated Storage Area Network (SAN) ! 
like Fiber Channel [1] to connect thfc storag^umts, the SCSI disks, 
to all of the nodes of the cluster. This choice is quite expensive as 
clusters already include a high-speed intercohnect to support message 
passing. One effort [2] focused on providing ; an'MPI implementation on 
top of Fiber Channel/ without a real success .' To use disks attached 
locally to each node of the cluster with a parallel file-system: the 
interconnect network is used to sehci I/O requests to remote nodes and 
to receive the data from them. It seems to be the solution chosen by 
the research community on cluster computing in order to preserve the 
cost-effective advantageof clusters. There are a lot of prototypes 
and production-quality parallel or distributed file-systems available 
worldwide like the well-known NFS [3] , GPFS for the IBM SP [4] or 
more recently PVFS [5] . All of these contributions present 
improvements and new features on parallel disk access, cache ' ;; 
management, load balancing, fault-tclerahcy, etc. Technology is 
improving rapidly and the availability of efficient storage buses such 
as SCSI Ultral60 or ATA-100 coupled with high-speed networks like 
Myrinet raises the expectation of parallel file-system implementations 



using these components. However, such parallel file-systems use a 
significant amount of resources, iike procepsqr cycles or memory on 
the nodes that host disks and it is not unusual to use dedicated nodes 
to act as I/O servers for the rest of the cluster. To process I/O 
requests from a remote process and server 4he data requested is still a 
very heavy charge on today's machines. 2 5 JL, Remote read The Remote Read 
is a basic operation of parallel file-systems. As the data is balanced 
along disks on different nodes, it is necessary to. send an I/O request 
via the interconnect network : to the remote nodes to ask them to read 
some data locally before sending it back to the client. Fig. 1 Fig. 1 
Data movements on the server side in the Remote Read operation between 
a SCSI controller and. a Myrinet interface, illustrate^ the data 
movements in a node processing an I/O Read request on a SCSI 
sub-system from a remote client via JVlyrinet. The data path] is 
particularly long and uses several ^^dwareiCO^nponents that can be 
possible bottlenecks. We can distinguish^Jlurge £tsps:* A) In response (i 
to the SCSI commands posted from ^e ; dr^er, in the kernel, the disks 
read the data and transfer it oyer .^^SI^us Xq> the SCSI 
controller, and the latter copies it t$ the, k^niel memory space yia 
the PCI bus and the memory bus using. ,^ J5M>V engine. B) After catching 
the interrupt from the SCSI controller- tp f indicate the, ;completion of 
the SCSI commands and the DMA copy,. > the kernel ; updates the buffer 
cache entries and copies the data into the ; user-level application 
memory space by crossing the memory r buS/twp times and using the , 
processor to perform the copy. CJ), Finally + the data is. sent to the «, 
remote node directly ;ftqm the appUcatioii memory space with .the 
zero-copy communication prptocpl available on MyrineW During this , 
operation, the data crosses t the mempiy c bvi.S; J ^djtlie PCI bus to the * 
Myrinet embedded memory before, being pushed on the link. Wexan see 
in [6] that the PCI bus and the memory fexis. are more of a limit on the 
throughput of this Remote Read operation than the SCSLbu? or the : 
Myrinet link. In fact, each^unit.of dat^passes one time on the SCSI- 
bus or the Myrinet link, two times, dy^r. ^PCI^bus and.fotn: times ; 
over the memory btjs^A^^ components 
tends to convergence, storage systenri an&the ^^mtercpimect ^e not l (i 
efficiently exploited. 2.2 Related>wor^[^&er ; projects, haye a ? sunijar , t 
approach to improve the Remote Rea£ operation by reducing the t critical 
path. 2.2.1 Linux raw I/p The raw ; I/O concept tries; to remoye.step B, 
in Fig. 1 . It enables access to toe storage id^yiM ^k^^ ^9™ ; t 
user-space application by bypassing ;the 3 kemel sp^ce, thereby ayoi^ t ; 
the buffer cache oyerhead.and savings memory copy^Raw I/O is . (V , . ; 
available as a patch to the Linux kernel, ?ind, after strong , pressure , . r ,. sl ( 
from the database community who wanted, tp ? minimigg die kernel cost and . 
allow the low-level storage device m^gement at.the, application ; > 
level, will be included in Linux kernels 2.4vX, 2.2.2 : Intelligent I/O . . j 
(120) 120 [7] is an ^chitecture designed to eliminate the ^ bottlenecks, : , t 
by using dedicated I/O processors that offload the main processor to ■ : 
handle the movements of data, the interrupts, the flow .control, etc; ; , t 



In the 120 model, the I/O operations are messages exchanged between 
I20 devices, e.g. between an 126 SCSI controller and a dedicated i960 
or directly between an 120 SCSI controller and an 120 ATM network 
interface. The 120 are completely autonomous and are able to support 
asynchronously a large part of the I/O processing. 120 has 
difficulties to entice the market, certainly due* to the very high cost 
of the I20 devices and the development tools, as well as the 
close-minded nature of the specifications. 2.2.3 Network Attached 
Storage (NAS) NAS is an important ongoing work. NAS differs from the 
S AN-like Fiber Channel by providing a file system ftinctionality 
instead of a fixed-size — block— -oriented interface". NAS is based 
on SAN for reliable and high-performance communication support, but 
the rich interface of the NAS removes the limitation of SAN in 
handling the storage devices as a single hardware unit shared by all 
of the clients. Gibson and Meter [8] presbnts'a large state-of-art. 3 
Contribution The data path between ttii£ Storage controller and the 
network interface passes through the host memory despite the fact that ' 
the data is not processed by the main processor 'before being sent to 
the I/O request emitter; The data goes though the host because of ' n 
system constraints:, the interactions with a local 'storage 1 controller ' 
are traditionally operated from a user application and the 
communication interface of the NIC assumes the data to be present in 
the main memory at the beginning of a send. It would be very efficient 
to drastically shorten the data path by moving data 'directly from the . 
storage controller tp the network interface oyer the jPCI bus ( Fig. % 
Fig. 2 Data movementson the server side in the Remote' Read operation 
with OPIOM. ) like the 120 peer-to-pfeer operations. It is technically 
possible as all of these devices provide* DM A engines and embedded : : 
memory on the PCI address space. With D>MA ' engines; a PCI device is 1 
able to write or read data hi main meniory but alsci on any bther PCI ( 
memory addresses. However, only a few devices provide enough memory* 
on-board and the flexibility to access this meinofy airid vise DMAs. We ; 
have chosen to work with Myrinet ihtefra^ software opennfess, ' 

the possibility to modify the firmware,' the ahrbhnt of embedded membjy ; 
(up to 8 MB for the latest generation) ^d^^pBiforniance. The 
Gigabit Ethernet Alteon" also presetitesbitte gbbd Characteristics but 
the performance and the flexibility of tjie fiririwalre are limited. We 
have designed ah interface to handle the Remote Read operation with 
the shortest possible data path between SCSI devices and a Myrinet 
interface: OPIOM. 3.1 OPIOID OPIOM stands for Off-Processor I/O with 
Myrinet. OPIOM is an implementation f or Lihuk of the optimization 
previously described. OPIOM supports all of the SCSI controllers 
supported by Linux and Myrinet as tlie network iriterfa:ce. However/ 
OPIOM is generic enough to easily suppefrt any other network interfaces 
providing the basic requirements: a PCI meinfcry spsice and the : 
possibility to send7receive pickets froin/to buffers on this embedded 
memory. The choice of Linux is obvious as we heed to know how the 
kernel processes I/O requests and to find the best point to insert our 



code. Linux is open-source and allows us to understand the I/O code 
and, eventually, to slightly modify it to serve our purpose. The 
implementation described below is organized into two main parts: one 
— section — is related to the interactions wj£h the storage 
controller, in our case the SCSI controller; and a second one 
dedicated to the usage of the Myrinet network Interface. 3.2 
Implementation 3.2. i SCSI side The storage part represents the active 
core of OPIOM. OPIOM is composed of a user library and a kernel module 
that will insert a new SCSI service # the top of the Linux SCSI stack 
( Fig. 3 Fig. 3 SCSI stack in the Linux kernel 2.2.x. ). 3.2.1.1 Linux 
SCSI stack The design of the SCSI stack in the Linux 2.2.x. kernels is 
particularly elegant: its decomposition into three layers separates 
the SCSI services (SD for disks, SR for CD-ROMS, ST for tapes and SG 
for generic devices) from the code of low-level drivers by an 
abstraction layer, the. SCSI middle Jayer. ,The..SG module presents 
interesting capabilities: it. is used to deyelpp, and test new SCSI 
services before their inte^ can post some SCSI 

requests to the middle level layer ^ ; 
is very similar to $6 in .the way ifa&ish% ^r^level application posts 1 , v 
asynchronous operations Via ibftt((/)..rail^ t ! 5 The OPIOM module generates 
the corresponding SCSI commands and i^gfates them into SQSI requests 
that are passed to the Linux SCSf middiellayer. The SCSI requests are 
composed of three fields: The SCSI command —block— that cpntains 
the command sequence will be. interpreted by t the $CSI deyice. It 
includes the SCSI operation code, thbntjmberj .Qf physical --blocks--- 
to process, etc.; Tfhe,yirtual memory address of the target buffer in 
the kernel space. '/Tb^ controller will Qopy^the 4ata to this buffer by 
DMA. A completion W the interrupt \ , 

handler when the.SC^ 

notified by a liard^are mte^pt ffom the SCSI controller. The SCSI 
middle layer merger trie SCSI reque;s^ lssue4by all of the SCSI 
services in tfie upper layer . It is.tben ^^^ib^e toTuse OPIOM to , . 
access a disk at &s s^e t^ regular I/Q. The 

middle layer c^n ^ ^§6^reaCT^ge the^CSt regue^ts using an elevator 
algorithm in order tqp^mijce the inpyement of thehe^ds £>f the disks. 
At the completion of a requeist, 3 tlxe ^d^e jfai^er , chepks tfte, eir or , 
code, eventually tries to fe<^vei; erjors^ , 
function associated with the request, ^s^ca^ ; ^lqw:s 
OPIOM to manage in a fully asynchronous"^ of £CSI f 

requests. 3-2.1 .2. Local file-systems ,The applic^ fiV y /r . V 

descriptors, noi numbers of physical ^blocks--- on 4isks. Therefore, > 
it is necessary for the OPIOM module, tp.trin^i^te. j&e appUcation .v j . _ ;| 
language, the ffle descriptors, into SCSI cpji^ '' n 
— blocks — on disks., r ^ 

integrated in a file-system abstraction <^^)he Yii^ v: .~|r 
(VFS). The VFS provides functions to map a^pgi 0 ^ <„ v 
physical — block — and to translate a position, in a, file to the 
reference of the physical — blocfcr-; on the disk. OPIOM supports any. 



file-system compliant with the Linux VFSreven with software RAID 

functionality. We must also address the issue of fragmentation. This 

issue is related to the local file-system usage: a SCSI command 

processes contiguous — blocks — on disks; however, the data is 

fragmented on the medium as a result of creation and destruction of 

files. OPIOM needs to decompose a global operation on a file, called 

an OPIOM slot , into one or more OPIOM requests that will be " * 

associated to SCSI requests. The completion of an OPIOM slot would 

mean the completion of all of the OPIOM requests composing it; 3.2.1.3 

Memory address conversion A major difficulty exists as a result of the 

design of the SCSI stack in Linux: a SCSI request includes the virtual 

address of the target buffer allocated in the kernel space. This 

virtual address is converted in the low-level driver corresponding to . *, 

the SCSI controller into a physical address usable by the DMA engine 

component. The problem is that the target buffer; with OPIOM is on the 

Myrinet board, not in the kernel memory sjp;ace. If OPIOM posts a SCSI 

request with the PCI address of the bifffer ill the Myrinet SRAM;' thb 1 

function virt_to_phys( ) in the low-level driver frill return a faise ' 

address somewhere in the kernel space arid^ J a DMA operation to such afi ; 

address will corrupt and crash the machirie l . : We are obliged to slightly 

modify the behavior of the virt_to_phys( ) function in order to avoid 

meaningless translation for physical addresses. The Linux kernel maps 

the physical host memory into the kernel space starting at the ' 

constant —PAGE— OFFSET. . The BIOS mkps the memory areas of PCI 

devices in the kernel address space; usually at the end of it. The 

addresses of buffers on the Myrinet board will be in this PCI area. In 

this context, to translate a kernel virtual address into a physical 

memory address/one only needs to subtract the value --PAGE— -_OFFSET 

from the virtual address. As the virtual addresses translated by this 

method are always in the kernel 1 space; we caniexfend the virt_tb^phys( 

) function: if (addrdss < -^-PAGE— ^OFFSET) Return 1 ' ' ! - 

(address + ~-PAGE~_OFFSET) ; else return;(x — PAGE--_OFFSET); W can 

then avoid a translation by removing the value —PAGE — ^OFFSET frohi 

the physical address. By this hack, we can avoid the ihodif icktibii of ' ' ; ; 

all of the low-level SCSI drivers supported by Lmiix. Thus, it is 

possible to transparently manage,in ! a fiilly asynchronous way I/O ' 

operations from SCSI disks to buffers on £ny PCI device, The OPIOM - " '[ 

kernel mpdule that handles the interactions with the Linux SCSI stack / 

can be dynamically, loaded. Thfe inirusivity is limited to the three 

lines added to the viit_to_phy&( ) function. No modifications will be 

needed with the Linux kernel 2.4.x because of a change in 

virt_tojphys( ) . 3.2.2 Myrinet side Once the data is in a buffer in J • ' 1 

the Myrinet memory, we ii^ed to be able to send it to another node. The 

Myrinet interface provides the two ^ i^ctionahties required: It embeds 

a large amount of memory, directly accessible frdm the PCI bus. 'It can 

send a packet on the link from a buffer on its own 1 embedded memory. 

The second functionality, called a send buffer-in-place , has been . 

implemented by modifying the GM [9] firmware for Myrinet. The send ! 



buffer-in-place is similar to a regular send without the first step 
that consists to copy the message from tlie.host memory to the Myrinet 
SRAM by DMA. GM's packets are limited to 4 KB and the header and the 
body of each packet, contiguous in the Myrinet memory, are written to 
the link in one DMA operation. This moclel does not allow to send a 
packet with the header in one place and, the body . in another one. As 
the number of memory areas ttx send packets is limited at 2, it is not 
possible to stop the sending activity on the Myrinet card during an 
OPIOM operation to copytfye data .directly to the right place. A simple 
solution consists in copying the data from an OPIOM buffer in the 
Myrinet SRAM to a send buffer in the logical structure of the GM 
firmware. The next generation of GM will manipulate headers and bodies 
separately, so this problem will disappear. The cost of this extra 
copy depends on the speed of the Lanai, the processor embedded on the 
Myrinet board. The Lanai 9 at 133 MHz can copy word-aligned data at, 
266 MB/s. GM implements a Rocking system call to sleep, while waiting 
for an interruptiqn from the bqard.Jp notify an event. While the f t { 
process is sleeping, it is, removed from.-tte , . .. . 

in the scheduler and effectively, does not use the t t CPU.„This is a very ; 
important feature. in; the context of parallel file-systems where t 
computation nodes on a cluster are also used for I/O. 4 
Experimentation We have conducted, experiments with OPIOM to validate 
the implementation and measure the performance gain versus the regular 
I/O implementation. -4.1 Platform The platform is composed of twp Linux 
boxes, one server and one client: Server: machine Dell PowerEdge 2300, 
Pentium H 450 MHz,. 156, MB, PGi 32 bits/33 MHz, tunning Linux 2.2.17. 
The kernel on this machine, includes the virt to_phys( ) function . 
modification. This machine hosts also a b storage sub-system composed of , 
an Adaptec AIC-7890 tyltxp. SCSI host adapter (one bus at 80 MB/s of 
theoretical peak throughput) used by the aic7xxx low-level driver and 
6 SCSI disks Seagate STS^SLq Tjltt^LVP' at 720Q rprnj il of them on , 
the unique SCSI bus. The disks, are connected by a hot-plug. SCSI , : 
backplane E£U ; T% 7 Myrinet interface 'j&£[PCl&B (Lanai. 9 at 133 MHz) . 
with 8 MB of embedded .memoiy . Tft^ for Myrinet , 

is GM 1.3.1 with the send buffer-in-place ^addition. The file-system / . 

used on the disk is the default Linux e^t2fs, ; compiled with.a logical 
— block — size of 4K. The Myrjinet Q^d Wd the SCSI controller are 
on the same PGl —segment—- without other PCI devices. Client,: Pin • 
500 MHz, 512 MB, PCI 32 ^its/33 MHz, ramung Linux 2.2/l7 : . The Myrinet 
interface is a PCI64B'(Lanai,9 at. 133 MHiX^ith 2 MB^of embeidded' , \ t 
memory. The communication interface for Myrinet is GjM 1.3. 1, without . 
any modification available on M^ricom's wpbsite. The Myrinet . , %ir . M . 
interfaces are connected with an 8-port switch. Despite the fact that 
these boards can use t a link at 2+2 GB/s,,the switch is not compliant *. 
with this new link speed and forces the interfaces to ,use the, former . . , . ;[ 
rate, 1.28+ 1 .28 GB/s. 4.2 Tests We ran three, series of tests on our r . * , . 
platform. The first experimentation aims to validate and evaluate .the T V 
performance of the OPIOM core interacting with the storage s^-system, 



and the second uses all of the components Involved in the Remote Read 
operation to implement the movement of data from disks to a remote 
node. The third test shows some preliminary results in the context of 
a parallel video server application. For all of these tests, the 
dataset is stripped along the 6 disks (RAID 0) at the application 
level, it provides more flexibility td investigate several stripping 
unit sizes without regenerating the dataset. Actually, each disk 
contains a 1 GB file filled with marks and stamps used to check the 
validity of the data on the client side for the copy test. 4.2.1 Local 
read This test tries to read data as quickly as possible from disks at 
destinations of buffers in the process memory space with the regular 
I/O implementation or in the Myrinet SRAM with OPIOM. With regular 
I/O, the test uses one thread per disk to simulate asynchronous reads v.L 
with the synchronous functions of the C library. Each thread reads 128 ; 
KB of data from its corresponding disk. The activity of I/O threads is 
managed by another thread that drives ttiei test and insures the order 
of data gathering. This allows us to operate several I/O requests' . lC 
concurrently and to be able to compare the results witli OPIOM. With 
the latter, we use two buffers of size 128 KB on the. Myrinet bpardPup 1 
to 4 disks, and four buffers after that. There is only one process ' ! 
that posts OPIOM read requests and waits for their completion. This 
benchmark is useless from a practical point of view, as the data on 1 
the Myrinet memory, is not checked or used. However, it is a very 
stressing benchmark for the storage sub-system, pig: 4 Fig. 4 ~ " ** : \ 
Performance of local read with OPIOM andifegula* I/O* illustrates the 
performance difference of the local tekii dperatiph with OPIOM and ttie 
regular I/O implementation. The SCSI Ultrk2 bus that can 1 theoretically 
reach 80 MB/s seems to saturate at 72 MB/^; eVen With more than 6 
disks on the bus. The throughput per disk ofT4 MB/s is coherent with 
their specification. New generation 10K ^fti dis^S Have been measured 
at 25 MB/s. We can see that OPIOM ahd r^]|iii^I/b pferforinance are ' 
similar and linear upt to 4 disks. At this goint, the regular I/O. u 
implementation tends tb af ira^mum^^^ 64 'MB/g.^Hoy ever, this 

is not the maximum ay liable; bandwidth, i£ i <^OM progresses ujj tB 72 
MB/s. Fig. 5 Fig/ 5 CPU uskge pflo^f^^y^tii^ OPIOM arid regular * 
I/O. emphasizes the significance' of OPIQ^7Bl \h6ws the'CPU usage . 
monitored using top during these lcx^r^^ The 1 

processor usage with OPIOM is very iDw (3% Maximum for 6 disks) in - 
contrast to the CPU utilization with the regular I/O code that 
increases linearly withjthe number of disks and is almost using all of 
the iriachine with* 6 clicks' at 8^%. "This CPU usage is due to the memory 
copy between the kelfnef spate arid the uset-lfevel application. This 
test also permits us to Validate the interaction With the Linux 'SCSI ' 
stack as the benchrhkrk lUs'friiii ^dturirig 2 cdmecutive days arid read more 
than 10 TB of data witfioiit any cbrihiption. 4.2.2 File copy Pleased by 
these promising results', we have iiiipldmented a high-performance copy 
benchmark where the data is read from disks arid streamed to a remote 
node via Myrinet. The test code is based on a pull architecture for ■ 



flow control reasons. Thus, the next piece of data is not sent to the 
client before the reception of a small message from this client 
indicating that the buffer is ready to proceed. We used the same code 
to read the data as in the previous test and the client side uses two 
128 KB size buffers to pipeline the reception of the packets from the 
server. Thus, the granularity along the pipeline is 128 KB. Fig. 6 
Fig. 6 Performance of copy with OPIOM and regular I/O. shows very 
distinctly an improvement using. OPIOM compared to the performance 
using regular I/O. We can. also^see that the aggregate throughput is 
cut at 45 MB/s: this limitation is certainly due to the PCI bus. at $0 
MB/s (120 MB/s half-duplex) minus some bus arbitration overhead. By 
its unique trayel on the, PCI bus, OPIOM is not sensitive to this 
limitation and can exploit the maximum of the storage system. TTheCPU 
load results in Fig. 7 Fig. 7 CI>U usage of copy .with OPIOM and regular 
I/O. confirms the very low host overhead of OPIOM, less than 3% of a 
Pentium II at 45Q MHz to read dau. jfrpnj {iisks and send it to a remotes 
node at the rate of 72 MB/s. In.thi^ case, ^ip pnly limitation comes _ 
from the SCSI bus. The CPU load measured during the test with the ' 
regular I/O implementation is npjt 2is high'as iri Fig. 5 because the 
throughput is reduced .by the PCI bus £iid so. the processor does not 
have to copy as much data. .4.2,3 Video server prototype This test aims 
to show the interest of OPIOM in. real applications where the read 
patterns are non-linear. In this c^se, the displacement of the heads 
of the disks is time consuming and the read-ahead aTgorithin of Storage 
units much less efficient. We .use in thisVexperimeht a prototype of a 
parallel video, server a? described in [6] : a node serves a yideo 
client by requesting d^ta f^om all of the other nodes, je-assembling 
the stream before sending it via the distribution network. 'The 
different video, strq^ms, are non-contiguous in.the loca| r file-sy stems. , 
In our experiments, pnly the server node has. the requested data. We 
measure the latency ^ and 
the delivery of requested data by the server node (transactions pf 128 
KB). We;pjp .$£e ■ ^ Fig'^IS Fig. 8 Latency ^ and jitter for .128 KB 
— block — with.the yideo server prototype using OPIOM and regvdar 
I/O . that the latency of tfye? read reqr its increases with the number 
of streams, as expected. However, J PIOM, sepms to support fairly well 
the load increase and the variation ■; this latency stays yeiy small. 
On the contrary, jthe regular, I/Q in plementation presents unstable 
performance/ with a very large range of variation and a qiiicjc 
degradation of the patency of the read requests under load increase. 
This bad behavior requires additional buffers and limits the number of 
streams concurrently supported^ decreasing the price/performance ratio 
of the video server^ 5 Conclusions aiid perspectives s parallel I/O is a . 
very important. research domain for high performance cluster computing. 
Today's clusterslcannot be used; at, the ma^mum.of their capacity . ... 
because of disappointing UO.performance compared to computational 
power. We have. designed a basic interface to optimize the datak + 
movement between disks and an intelligent network interface with Linux. 



Our implementation with SCSI and Myrinet, OPIOM, provides 
high-performance throughput and very low host overhead as well as a 
UNIX-like transparent I/O library. This tool can be used as a basic 
Remote Read functionality for parallel file-systems or MPI-I/O 
implementations where the I/O nodes and the computation nodes can be 
the same and where the communication interconnect can be used as an 
SAN, reducing the total cost of clusters. We plan to extend our 
experimentation with higher-level machines, 64 bits/66 MHz PCI bus and 
Ultra3 SCSI bus, in order to saturate the resources that limit the 
regular I/O model, the memory bus and the processor. The next 
development around OPIOM will provide access to IDE disks, as the way 
Linux handles them is very similar to the SCSI devices. An emulation 
mode is already present in the Linux kernel to handle IDE controllers 
via the SCSI stack. The Write operation with OPIOM is also fairly easy 
to implement as the data path is the same as for the Read operation, 
except that the DMA engine of the SCSI controller would read from the 
Myrinet board to the SCSI device. A buffer cache invalidation 
mechanism would be needed in this case, as the consistency cannot be 
guaranteed and the data path avoids the host. We also plan to 
integrate OPIOM into the next GM releases, with some dynamic Myrinet 
buffer management and an optimized OPIOM send/receive operation. [1] 
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